19 research outputs found

    Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

    Full text link
    We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media (ICWSM), 201

    Riconoscimento real-time di gesture tramite tecniche di machine learning

    Get PDF
    Il riconoscimento delle gesture è un tema di ricerca che sta acquisendo sempre più popolarità, specialmente negli ultimi anni, grazie ai progressi tecnologici dei dispositivi embedded e dei sensori. Lo scopo di questa tesi è quello di utilizzare alcune tecniche di machine learning per realizzare un sistema in grado di riconoscere e classificare in tempo reale i gesti delle mani, a partire dai segnali mioelettrici (EMG) prodotti dai muscoli. Inoltre, per consentire il riconoscimento di movimenti spaziali complessi, verranno elaborati anche segnali di tipo inerziale, provenienti da una Inertial Measurement Unit (IMU) provvista di accelerometro, giroscopio e magnetometro. La prima parte della tesi, oltre ad offrire una panoramica sui dispositivi wearable e sui sensori, si occuperà di analizzare alcune tecniche per la classificazione di sequenze temporali, evidenziandone vantaggi e svantaggi. In particolare, verranno considerati approcci basati su Dynamic Time Warping (DTW), Hidden Markov Models (HMM), e reti neurali ricorrenti (RNN) di tipo Long Short-Term Memory (LSTM), che rappresentano una delle ultime evoluzioni nel campo del deep learning. La seconda parte, invece, riguarderà il progetto vero e proprio. Verrà impiegato il dispositivo wearable Myo di Thalmic Labs come caso di studio, e saranno applicate nel dettaglio le tecniche basate su DTW e HMM per progettare e realizzare un framework in grado di eseguire il riconoscimento real-time di gesture. Il capitolo finale mostrerà i risultati ottenuti (fornendo anche un confronto tra le tecniche analizzate), sia per la classificazione di gesture isolate che per il riconoscimento in tempo reale

    Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion

    Full text link
    Neural Radiance Fields (NeRF) coupled with GANs represent a promising direction in the area of 3D reconstruction from a single view, owing to their ability to efficiently model arbitrary topologies. Recent work in this area, however, has mostly focused on synthetic datasets where exact ground-truth poses are known, and has overlooked pose estimation, which is important for certain downstream applications such as augmented reality (AR) and robotics. We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available. Our approach recovers an SDF-parameterized 3D shape, pose, and appearance from a single image of an object, without exploiting multiple views during training. More specifically, we leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution which is then refined via optimization. Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios. We demonstrate state-of-the-art results on a variety of real and synthetic benchmarks

    Analyzing Input and Output Representations for Speech-Driven Gesture Generation

    Full text link
    This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.Comment: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencode

    Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

    Full text link
    Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to 2×2\times increase in inference throughput and even greater memory savings

    ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit

    Full text link
    Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).Comment: 10 pages, 5 figures, Accepted by ACM MM 202

    Structured Generative Models for Controllable Scene and 3D Content Synthesis

    No full text
    Deep learning has fundamentally transformed the field of image synthesis, facilitated by the emergence of generative models that demonstrate remarkable ability to generate photorealistic imagery and intricate graphics. These models have advanced a wide range of industries, including art, gaming, movies, augmented & virtual reality (AR/VR), and advertising. While realism is undoubtedly a major contributor to their success, the ability to control these models is equally important in ensuring their practical viability and making them more useful for downstream applications. For instance, it is natural to describe an image through natural language, sketches, or attributes controlling the style of specific objects. Therefore, it is convenient to devise generative frameworks that follow a workflow similar to that of an artist. Furthermore, for interactive applications, the generated content needs to be visualized from various viewpoints while making sure that the identity of the scene is preserved and is consistent across multiple views. Addressing this issue is interesting not only from an application-oriented standpoint, but also from an image understanding perspective. Our visual system perceives 2D projections of 3D scenes, but the convolutional architectures commonly used in generative models ignore the concept of image formation and attempt to learn this structure from the data. Generative models that explicitly reason about 3D representations can provide disentangled control over shape, pose, appearance, can better handle spatial phenomena such as occlusions, and can generalize with less data. These practical requirements motivate the need for generative models driven by structured representations that are efficient, easily interpretable, and more aligned with human perception. In this dissertation, we initially focus on the research question of controlling generative adversarial networks (GANs) for complex scene synthesis. We observe that, while existing approaches exhibit some degree of control over simple domains such as faces or centered objects, they fall short when it comes to complex scenes consisting of multiple objects. We therefore propose a weakly-supervised approach where generated images are described by a sparse scene layout (i.e. a sketch), and in which the style of individual objects can be refined through textual descriptions or attributes. We then show that this paradigm can effectively be used to generate complex images without trading off realism for control. Next, we address the aforementioned issue of view consistency. Following recent advances in differentiable rendering, we introduce a convolutional mesh generation paradigm that can be used to generate textured 3D meshes using GANs. This model can natively reason using 3D representations, and can therefore be used to generate 3D content for computer graphics applications. We also demonstrate that our 3D generator can be controlled using standard techniques that can also be applied to 2D GANs, and successfully condition our model on class labels, attributes, and textual descriptions. We then observe that methods for 3D content generation typically require ground-truth poses, restricting their applicability to simple datasets where these are available. We therefore propose a follow-up approach to relax this requirement, demonstrating our method on a larger set of classes from ImageNet. Finally, we draw inspiration from the literature on Neural Radiance Fields (NeRF) and incorporate this recently-proposed representation into our work on 3D generative modelling. We show how these models can be used to solve a series of downstream tasks such as single-view 3D reconstruction. To this end, we propose an approach that bridges NeRFs and GANs to reconstruct the 3D shape, appearance, and pose of an object from a single 2D image. Our approach adopts a bootstrapped GAN inversion strategy where an encoder produces a first guess of the solution, which is then refined through optimization by inverting a pre-trained 3D generator
    corecore